Implement FlashAttention for CPU #20805

duanqn · 2024-05-24T08:27:06Z

Description

Implement FlashAttention and FlashAttention-2 for MultiHeadAttention on CPU.

Motivation and Context

Accelerate the execution of MultiHeadAttention.

Current performance: 10ms vs 16ms (com.microsoft.MultiHeadAttention) on my Linux machine and 10ms vs 38ms (com.microsoft.MultiHeadAttention) on my Windows machine. May need further optimizations.

onnxruntime/contrib_ops/cpu/bert/flash_attention.cc

onnxruntime/contrib_ops/cpu/bert/flash_attention.h

onnxruntime/contrib_ops/cpu/bert/multihead_attention.cc

onnxruntime/core/mlas/inc/mlas_flashattn.h

onnxruntime/test/python/transformers/benchmark_mha.py

onnxruntime/contrib_ops/cpu/bert/multihead_attention.cc

onnxruntime/core/mlas/inc/mlas_flashattn.h

duanqn · 2024-06-19T09:26:27Z

Test failing: MultiHeadAttentionTest.CrossAttention_DiffSequenceLengths

Edit: passed

github-advanced-security

PREfast found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

onnxruntime/core/mlas/lib/flashattn.cpp

onnxruntime/contrib_ops/cpu/bert/multihead_attention.cc

onnxruntime/core/mlas/lib/flashattn.cpp

duanqn · 2024-06-21T10:14:16Z

Environment Variables:
ORT_DISABLE_FLASH_ATTENTION=0

format causal batch seqlen heads h_dim ms TFLOPS kernel
Q,K,V False 1 128 32 128 1.59 0.17 CPU:Flash
Q,K,V False 1 256 32 128 2.74 0.39 CPU:Flash
Q,K,V False 1 512 32 128 8.28 0.52 CPU:Flash
Q,K,V False 1 1024 32 128 26.43 0.65 CPU:Flash
Q,K,V False 1 2048 32 128 88.92 0.77 CPU:Flash
Q,K,V False 1 4096 8 40 36.26 0.59 CPU:Flash
Q,K,V False 1 4096 8 80 54.36 0.79 CPU:Flash
Q,K,V False 1 4096 8 160 99.28 0.87 CPU:Flash
Q,K,V False 4 4096 8 40 144.85 0.59 CPU:Flash
Q,K,V False 4 4096 8 80 217.08 0.79 CPU:Flash
Q,K,V False 4 4096 8 160 400.06 0.86 CPU:Flash
Q,K,V False 1 16384 8 40 570.16 0.60 CPU:Flash
Q,K,V False 1 16384 8 80 854.11 0.80 CPU:Flash
Q,K,V False 1 16384 8 160 1511.06 0.91 CPU:Flash
Q,K,V False 128 128 12 64 29.84 0.22 CPU:Flash
Q,K,V False 64 128 12 64 14.82 0.22 CPU:Flash
Q,K,V False 128 384 12 64 131.07 0.44 CPU:Flash
Q,K,V False 64 384 12 64 65.70 0.44 CPU:Flash
Q,K,V False 128 512 12 64 203.86 0.51 CPU:Flash
Q,K,V False 64 512 12 64 99.83 0.52 CPU:Flash
Q,K,V False 4 2048 32 128 350.01 0.79 CPU:Flash
Q,K,V False 4 4096 32 128 1278.42 0.86 CPU:Flash
Q,K,V False 8 2048 32 128 698.98 0.79 CPU:Flash
Q,K,V False 8 4096 32 128 2547.00 0.86 CPU:Flash

Environment Variables:
ORT_DISABLE_FLASH_ATTENTION=1

format causal batch seqlen heads h_dim ms TFLOPS kernel
Q,K,V False 1 128 32 128 1.43 0.19 CPU:Unfused
Q,K,V False 1 256 32 128 3.24 0.33 CPU:Unfused
Q,K,V False 1 512 32 128 11.26 0.38 CPU:Unfused
Q,K,V False 1 1024 32 128 36.88 0.47 CPU:Unfused
Q,K,V False 1 2048 32 128 106.25 0.65 CPU:Unfused
Q,K,V False 1 4096 8 40 49.43 0.43 CPU:Unfused
Q,K,V False 1 4096 8 80 75.99 0.57 CPU:Unfused
Q,K,V False 1 4096 8 160 137.47 0.62 CPU:Unfused
Q,K,V False 4 4096 8 40 194.25 0.44 CPU:Unfused
Q,K,V False 4 4096 8 80 298.62 0.58 CPU:Unfused
Q,K,V False 4 4096 8 160 540.00 0.64 CPU:Unfused
Q,K,V False 1 16384 8 40 962.66 0.36 CPU:Unfused
Q,K,V False 1 16384 8 80 1389.89 0.49 CPU:Unfused
Q,K,V False 1 16384 8 160 2605.56 0.53 CPU:Unfused
Q,K,V False 128 128 12 64 33.08 0.19 CPU:Unfused
Q,K,V False 64 128 12 64 16.26 0.20 CPU:Unfused
Q,K,V False 128 384 12 64 149.92 0.39 CPU:Unfused
Q,K,V False 64 384 12 64 75.20 0.39 CPU:Unfused
Q,K,V False 128 512 12 64 234.68 0.44 CPU:Unfused
Q,K,V False 64 512 12 64 117.20 0.44 CPU:Unfused
Q,K,V False 4 2048 32 128 409.42 0.67 CPU:Unfused
Q,K,V False 4 4096 32 128 1561.20 0.70 CPU:Unfused
Q,K,V False 8 2048 32 128 814.60 0.67 CPU:Unfused
Q,K,V False 8 4096 32 128 3112.91 0.71 CPU:Unfused

tianleiwu · 2024-06-21T21:16:32Z

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

tianleiwu · 2024-06-21T21:16:34Z

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline

tianleiwu · 2024-06-21T21:16:35Z

/azp run Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

azure-pipelines · 2024-06-21T21:16:51Z

Azure Pipelines successfully started running 3 pipeline(s).

azure-pipelines · 2024-06-21T21:17:08Z

Azure Pipelines successfully started running 10 pipeline(s).

azure-pipelines · 2024-06-21T21:17:09Z

Azure Pipelines successfully started running 10 pipeline(s).

tianleiwu · 2024-07-11T04:19:34Z

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

tianleiwu · 2024-07-11T04:19:42Z

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline

tianleiwu · 2024-07-11T04:19:50Z

/azp run Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

azure-pipelines · 2024-07-11T04:20:03Z

Azure Pipelines successfully started running 3 pipeline(s).

azure-pipelines · 2024-07-11T04:20:09Z

Azure Pipelines successfully started running 10 pipeline(s).

azure-pipelines · 2024-07-11T04:20:18Z

Azure Pipelines successfully started running 10 pipeline(s).

yufenglee · 2024-07-11T21:19:56Z

@duanqn, thank you very much for your contribution, Qingnan!

duanqn commented May 27, 2024

View reviewed changes

onnxruntime/contrib_ops/cpu/bert/flash_attention.cc Outdated Show resolved Hide resolved

duanqn commented May 27, 2024

View reviewed changes

onnxruntime/contrib_ops/cpu/bert/flash_attention.cc Outdated Show resolved Hide resolved

github-advanced-security bot found potential problems May 28, 2024

View reviewed changes

onnxruntime/contrib_ops/cpu/bert/flash_attention.cc Fixed Show fixed Hide fixed

onnxruntime/contrib_ops/cpu/bert/flash_attention.h Fixed Show fixed Hide fixed

duanqn force-pushed the qiduan/flash branch from db5fcb2 to f7235b3 Compare June 12, 2024 06:42

tianleiwu reviewed Jun 13, 2024

View reviewed changes

onnxruntime/contrib_ops/cpu/bert/multihead_attention.cc Outdated Show resolved Hide resolved

tianleiwu reviewed Jun 13, 2024

View reviewed changes

onnxruntime/core/mlas/inc/mlas_flashattn.h Outdated Show resolved Hide resolved

tianleiwu reviewed Jun 13, 2024

View reviewed changes

onnxruntime/core/mlas/inc/mlas_flashattn.h Outdated Show resolved Hide resolved

tianleiwu reviewed Jun 14, 2024

View reviewed changes

onnxruntime/test/python/transformers/benchmark_mha.py Outdated Show resolved Hide resolved

tianleiwu reviewed Jun 14, 2024

View reviewed changes

onnxruntime/contrib_ops/cpu/bert/multihead_attention.cc Outdated Show resolved Hide resolved

duanqn force-pushed the qiduan/flash branch from 37d2325 to c8c12ff Compare June 17, 2024 03:11

tianleiwu reviewed Jun 18, 2024

View reviewed changes

onnxruntime/contrib_ops/cpu/bert/multihead_attention.cc Outdated Show resolved Hide resolved

duanqn commented Jun 19, 2024

View reviewed changes

onnxruntime/core/mlas/inc/mlas_flashattn.h Outdated Show resolved Hide resolved

github-advanced-security bot found potential problems Jun 19, 2024

View reviewed changes

tianleiwu reviewed Jun 19, 2024

View reviewed changes

onnxruntime/core/mlas/lib/flashattn.cpp Outdated Show resolved Hide resolved

tianleiwu mentioned this pull request Jun 20, 2024

[CPU] SparseAttention op #21110

Merged

3 tasks

duanqn force-pushed the qiduan/flash branch 2 times, most recently from f858430 to 599ac3f Compare June 20, 2024 07:34

tianleiwu marked this pull request as ready for review June 20, 2024 17:01

tianleiwu requested a review from a team as a code owner June 20, 2024 17:01

tianleiwu reviewed Jun 21, 2024

View reviewed changes

onnxruntime/contrib_ops/cpu/bert/multihead_attention.cc Outdated Show resolved Hide resolved

duanqn force-pushed the qiduan/flash branch from 7b82ac5 to 60e2280 Compare June 21, 2024 08:20

duanqn commented Jun 21, 2024

View reviewed changes

onnxruntime/core/mlas/lib/flashattn.cpp Outdated Show resolved Hide resolved

duanqn and others added 18 commits July 11, 2024 11:44

causal=False

8b2270a

Add MLASCALL on implementation

b449524

Improve comment

06251b1

Enable FlashAttention by default

27b18d4

lintrunner -a

3059b44

Remove memset

412f219

Fix l2_cache_size_

44ff8f0

Fix PREfast

5421335

#include <algorithm>

03d8f36

Fix bug

d63e528

lintrunner

7a3d4a6

Renaming

bf014d0

Renaming

baff456

Use MlasSgemmOperation

72f3c67

Move threading inside MLAS kernel

e8a4373

Remove MLASCALL

46a8ce9

Remove 1 TODO

e1cf289

Renaming

852fd98

duanqn force-pushed the qiduan/flash branch from 5554531 to 852fd98 Compare July 11, 2024 03:45

tianleiwu approved these changes Jul 11, 2024

View reviewed changes

tianleiwu requested a review from yufenglee July 11, 2024 19:00

yufenglee approved these changes Jul 11, 2024

View reviewed changes

yufenglee merged commit 80b56fe into microsoft:main Jul 11, 2024
86 of 88 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement FlashAttention for CPU #20805

Implement FlashAttention for CPU #20805

duanqn commented May 24, 2024 •

edited

Loading

duanqn commented Jun 19, 2024 •

edited

Loading

github-advanced-security bot left a comment

duanqn commented Jun 21, 2024

tianleiwu commented Jun 21, 2024

tianleiwu commented Jun 21, 2024

tianleiwu commented Jun 21, 2024

azure-pipelines bot commented Jun 21, 2024

azure-pipelines bot commented Jun 21, 2024

azure-pipelines bot commented Jun 21, 2024

tianleiwu commented Jul 11, 2024

tianleiwu commented Jul 11, 2024

tianleiwu commented Jul 11, 2024

azure-pipelines bot commented Jul 11, 2024

azure-pipelines bot commented Jul 11, 2024

azure-pipelines bot commented Jul 11, 2024

yufenglee commented Jul 11, 2024

Implement FlashAttention for CPU #20805

Implement FlashAttention for CPU #20805

Conversation

duanqn commented May 24, 2024 • edited Loading

Description

Motivation and Context

duanqn commented Jun 19, 2024 • edited Loading

github-advanced-security bot left a comment

Choose a reason for hiding this comment

duanqn commented Jun 21, 2024

tianleiwu commented Jun 21, 2024

tianleiwu commented Jun 21, 2024

tianleiwu commented Jun 21, 2024

azure-pipelines bot commented Jun 21, 2024

azure-pipelines bot commented Jun 21, 2024

azure-pipelines bot commented Jun 21, 2024

tianleiwu commented Jul 11, 2024

tianleiwu commented Jul 11, 2024

tianleiwu commented Jul 11, 2024

azure-pipelines bot commented Jul 11, 2024

azure-pipelines bot commented Jul 11, 2024

azure-pipelines bot commented Jul 11, 2024

yufenglee commented Jul 11, 2024

duanqn commented May 24, 2024 •

edited

Loading

duanqn commented Jun 19, 2024 •

edited

Loading